Working with the Super-Network

In this doc we will explore working with the super-network, S, the mother of all networks. (More formally, the largest connected sub-graph in the union of all gene networks from KEGG.)

Load whatever libraries we need:

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggplot2)

Load the version of S constructed by Bryan:

#load("~/Dropbox/Mac/Desktop/Simple_SIF_network_reactome.RData")
load("Simple_SIF_network_reactome.RData")

The network is in a variable called g.subg. It is an 'igraph' We'll rename it to S:

S <- g.subg
rm(g.subg)
#head(S)

The summary of S gives us a few useful bits of info:

summary(S)
## IGRAPH cc0c112 DN-- 11870 347970 -- 
## + attr: name (v/c)

DN means that it is a 'D'irected network in which th e nodes have 'N'ames. There are 11870 nodes and 347970 edges

We will want to take subnetworks with S. Let's explore how to do this using a smaller graph. Here's how to construct a graph from the existing igraph library:

g <- make_graph('Zachary')
plot(g)

Here's how to generate a random graph (taken from https://r.igraph.org/articles/igraph.html).

set.seed(33)
rg1 <- sample_grg(50, 0.2,coords=TRUE)
summary(rg1)
## IGRAPH 6154f9b U--- 50 151 -- Geometric random graph
## + attr: name (g/c), radius (g/n), torus (g/l), x (v/n), y (v/n)
vertex_attr_names(rg1)
## [1] "x" "y"
plot(rg1,frame.width=3)
plot(rg1,frame.width=6,label.cex=0.1,layout=layout_nicely)

That example generates a geometric random graph: n points are chosen randomly and uniformly inside the unit square and pairs of points closer to each other than a predefined distance d are connected by an edge.

It doesn't plot particularly nicely yet.

Manipulating the vertex attributes:

xcoord <- get.vertex.attribute(rg1,'x')
ycoord <- get.vertex.attribute(rg1,'y')
coords <- cbind(xcoord,ycoord)
plot(rg1,vertex.label.cex=0.5,vertex.layout=5*coords, vertex.size=5, vertex.color="blue",margin=0,asp=0.5)

V(rg1)$x   # to access the x-coords
##  [1] 0.01551696 0.03309035 0.13576507 0.19423622 0.22505121 0.23364797
##  [7] 0.23532162 0.25058754 0.27603364 0.31254372 0.32910179 0.39165069
## [13] 0.41177239 0.41792391 0.42444903 0.43712500 0.44299896 0.44594048
## [19] 0.48372887 0.48831925 0.49398192 0.50139953 0.53206409 0.54868278
## [25] 0.55463383 0.56166844 0.56645266 0.57342672 0.59243182 0.59663026
## [31] 0.61428449 0.62576846 0.63907806 0.67708897 0.69098590 0.72352260
## [37] 0.75870834 0.76536072 0.77474887 0.78069204 0.78188794 0.78455681
## [43] 0.80031854 0.84388144 0.90909936 0.91804240 0.94327643 0.96678586
## [49] 0.96966171 0.97731280

Let's aim to construct a subgraph of all vertices within 2 steps of vertex 3. What pre-defined functions might be useful here?

Edges, vertices and entire mx can be accessed as follows: (nice tutorial at https://kateto.net/wp-content/uploads/2016/01/NetSciX_2016_Workshop.pdf)

E(rg1)
## + 151/151 edges from 6154f9b:
##   [1]  1-- 3  1-- 4  3-- 4  3-- 9  4-- 6  4-- 7  4-- 9  4--12  5-- 6  5--11
##  [11]  6-- 7  6-- 9  6--12  6--15  7-- 9  7--12  7--14  8--10  8--13  8--16
##  [21]  9--12  9--14  9--15  9--17 10--13 10--16 10--19 10--20 11--18 11--21
##  [31] 12--14 12--15 12--17 12--22 12--28 13--16 13--19 13--20 13--23 13--25
##  [41] 13--27 14--15 14--17 14--22 14--28 15--17 15--18 15--21 15--22 15--24
##  [51] 15--28 16--19 16--20 16--23 16--25 16--27 16--31 17--18 17--21 17--22
##  [61] 17--28 18--21 18--24 18--29 18--30 19--20 19--23 19--25 19--31 19--32
##  [71] 19--33 19--34 20--23 20--25 20--31 20--32 20--33 20--34 21--24 21--28
##  [81] 21--29 21--30 22--28 22--35 23--25 23--27 23--31 23--32 23--33 23--34
##  [91] 24--28 24--29 24--30 25--31 25--32 25--33 25--34 26--27 27--31 28--35
## + ... omitted several edges
V(rg1)
## + 50/50 vertices, from 6154f9b:
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#net[]
diam <- get_diameter(rg1, directed=F)
diam
## + 11/50 vertices, from 6154f9b:
##  [1]  1  4 12 22 35 37 45 41 31 16  8
plot(rg1, layout=coords, vertex.color="cyan",)

Un-used code that helped me debug (in which I switched to a ring graph...)

rg1 <- make_ring(n=20)
plot(rg1)

It is simple to calculate distances between nodes:

DM <- distances(rg1,v=V(rg1),to=V(rg1))

Create a sub-graph of all vertices within 2 of vertex 3 (carrying over the edges that connect them). I want to keep track of the original vertex IDs, so I do that manually (they are lost otherwise):

sub <- as.numeric(DM[3,]<2.5)
subIDs <- which(sub==1)
V(rg1)$MyID <- 1:vcount(rg1)
rsub1 <- subgraph(rg1,vids=subIDs)
plot(rsub1,vertex.size=1)

plot(rsub1,vertex.size=1,vertex.label=V(rsub1)$MyID)

Let's try it on the super-network S:

DM <- distances(S,v=V(S),to=V(S))
sub <- as.numeric(DM[3,]<1.5)
subIDs <- which(sub==1)
V(S)$MyID <- 1:vcount(S)
rsub1 <- subgraph(S,vids=subIDs)
plot(rsub1,vertex.size=1)

plot(rsub1,vertex.size=1,vertex.label=V(rsub1)$MyID)

There are over 4000 vertices within 2 steps of vertex 3, which is a bit of a problem for plotting! there are 300 nodes that are neighbors of vertex 10.

Let's find a better vertex to work with We look at the degree of nodes to help

LowDegreeVertices <- which(degree(S)<2)
(length(LowDegreeVertices))
## [1] 810
sub <- as.numeric(DM[LowDegreeVertices[1],]<4.5)
subIDs <- which(sub==1)
V(S)$MyID <- 1:vcount(S)
rsub1 <- subgraph(S,vids=subIDs)
plot(rsub1,vertex.size=1)

plot(rsub1,vertex.size=1,vertex.label=V(rsub1)$MyID,vertex.label.cex=0.5,edge.arrow.size=0.1)

plot(rsub1,vertex.size=1,vertex.label=V(rsub1)$names,vertex.label.cex=0.5,edge.arrow.size=0.1)

Let's write a function to pull out a subgraph of all vertices within 'NeighborDist' of vertex FocalVertex (the index of the focal vertex) on graph 'Network'

NeighborSubgraph <- function(Network,FocalVertex,NeighborDist)
{
  DM <- distances(Network,v=V(Network),to=V(Network))
  sub <- as.numeric(DM[FocalVertex,] <= NeighborDist)
  subIDs <- which(sub==1)
  V(Network)$MyID <- 1:vcount(Network)
  ThisSub <- subgraph(Network,vids=subIDs)
  return (ThisSub)
}

And a function to pick low degree vertices, which might be good focal vertices:

LowDegreeVertices <- function(Network,MaxDegree)
{
  IDs <- which(degree(Network)<=MaxDegree)
  #IDs <- which(LDVs==1)
  return (IDs)
}

Test it:

S2 <- make_ring(20)
set.seed(33)
S3 <- sample_grg(50, 0.2,coords=TRUE)
plot(S3)

x <- components(S3)
cat("Number of connected components: ",length(x$csize),"\n")
## Number of connected components:  2
LowDegVerts <- LowDegreeVertices(S3,3)
(LowDegVerts)
## [1]  1  2  3  5  8 11 26 43 45
SubNet <- NeighborSubgraph(S3,LowDegVerts[1],2)
plot(SubNet,vertex.label=V(SubNet)$MyID,vertex.label.cex=0.8,edge.arrow.size=0.3)

SubNet <- NeighborSubgraph(S3,LowDegVerts[1],3)
plot(SubNet,vertex.label=V(SubNet)$MyID,vertex.label.cex=0.8,edge.arrow.size=0.3)

SubNet <- NeighborSubgraph(S3,LowDegVerts[1],4)
plot(SubNet,vertex.label=V(SubNet)$MyID,vertex.label.cex=0.8,edge.arrow.size=0.3)

SubNet <- NeighborSubgraph(S3,LowDegVerts[9],4)
plot(SubNet,vertex.label=V(SubNet)$MyID,vertex.label.cex=0.8,edge.arrow.size=0.3)

Next, let's try defining a subgraph in a different way. We will start with one vertex and then iterate the following untit the graph is big enough 1. Pick a vertex in the subgraph uniformly at random 2. Find a neighbor of that vertex: add it if it is not already in the graph; else go back to 1.

Here's some code suggested by Copilot (I gave it the first two lines)

g <- sample_grg(50, 0.2,coords=TRUE)
subg <- NA
plot(g)

# pick a random vertex  (I gave this to Copilot, it suggested the following 7 lines)
v <- sample(1:vcount(g), 1)
# get the shortest paths from v to all other vertices
sp <- get.shortest.paths(g, v, mode="out")
# get the indices of the vertices that are at distance 1 from v
v2 <- which(sp[[1]]==1)
# sample one of those
v2 <- sample(v2, 1)

# add it to subg
subg <- c(subg, v2)

That doesn't quite work, but suggests this:

set.seed(33)
g <- sample_grg(50, 0.2,coords=TRUE)
subg <- NA
plot(g)

v <- sample(1:vcount(g), 1)
# get the neighbors of v 
v1 <- neighbors(g, v, mode="all")

# sample one of those
v2 <- sample(v1, 1)

# add it to subg
subg <- c(subg, v2)
# Now wrap this up in a function (given this comment, copilot suggests the following)
BuildSubGraph(g, v, subg) <- function(g, v, subg) {
  # pick a random vertex
  v <- sample(1:vcount(g), 1)
  # get the shortest paths from v to all other vertices
  v1 <- neighbors(g, v, mode="all")

  # sample one of those
  v2 <- sample(v1, 1)

  # sample one of those
  v2 <- sample(v2, 1) 
  
  # add it to subg
  subg <- c(subg, v2)
  
  return(subg)
} 

That's not quite right, but not far away from this, which works:

BuildSubGraph <- function(g, subgsize) {
  size <- 0
  # label the vertices with our own IDs, that can be copied over
  V(g)$MyID <- 1:vcount(g)
  
  # pick a random vertex to start with (this may go wrong if the graph isn't fully-connected)
  subg <- sample(1:vcount(g), 1)
  size <- length(subg)
  itcount<-1
  cat("\nIteration: ",itcount,"  Subgraph size=",size,"  vertices: ",subg)
  while  ((size < subgsize)&&(itcount<1e6))
  {
    itcount <- itcount+1
    # pick a random vertex in the graph currently
    if (length(subg)==1){
      v<- subg
    }
    else
    {
      v <- sample(subg,1)
    }
    # get the neighbors of v
    v1 <- neighbors(g, V(g)[v], mode="all")
    # sample one of those
    if (length(v1)>0){
      if (length(v1)==1){
        v2 <- v1
      }else{
         v2 <- sample(v1, 1)
      }
      # add it to subg
      if (!(v2 %in% subg)){
        cat("\nFocal vertex: ",v,"  nbrs: ",v1,"  sampled: ",v2)
        subg <- c(subg, v2)
      } 
      size <- length(subg)
    }
    cat("\nIteration: ",itcount,"  Subgraph size=",size,"  vertices: ",subg)
    
    # debugging:
    compcount <- count_components(subgraph(g,vids=subg))
    if (compcount>1){
      cat("\n***Error. Disconnected subgraph")
      tempsub <- subgraph(g,vids=subg)
      #plot(tempsub,vertex.label=V(tempsub)$MyID,vertex.label.cex=0.8,edge.arrow.size=0.3)
     return(tempsub)
    }
  }
  Sub <- subgraph(g,vids=subg)
  if (size==subgsize){
    return(Sub)
  }else{
    cat("\nSub-graph routine failed: too many iterations")
    return(NA)
  }
}      

Test that function:

set.seed(33)
GraphSize <- 100
ConnectionProb <- 0.2
g <- sample_grg(GraphSize, ConnectionProb,coords=TRUE)
while (count_components(g)>1){
  g <- sample_grg(GraphSize, ConnectionProb,coords=TRUE)
}
plot(g,vertex.label=V(g)$MyID)

SubGraphSize <- 10
for (i in 1:5){
  NewSub <- BuildSubGraph(g,SubGraphSize)
  plot(NewSub,vertex.label=V(NewSub)$MyID,vertex.label.cex=0.8,edge.arrow.size=0.3)
  cat("\n#components: ",count_components(NewSub))
  if (count_components(NewSub)>1){i<-5}
}
## 
## Iteration:  1   Subgraph size= 1   vertices:  39
## Focal vertex:  39   nbrs:  21 24 29 30 43 47 48 49 50 53 57 61   sampled:  29
## Iteration:  2   Subgraph size= 2   vertices:  39 29
## Focal vertex:  39   nbrs:  21 24 29 30 43 47 48 49 50 53 57 61   sampled:  61
## Iteration:  3   Subgraph size= 3   vertices:  39 29 61
## Iteration:  4   Subgraph size= 3   vertices:  39 29 61
## Focal vertex:  39   nbrs:  21 24 29 30 43 47 48 49 50 53 57 61   sampled:  49
## Iteration:  5   Subgraph size= 4   vertices:  39 29 61 49
## Focal vertex:  61   nbrs:  39 43 47 48 49 50 51 53 57 58 66 68 75 76 78   sampled:  47
## Iteration:  6   Subgraph size= 5   vertices:  39 29 61 49 47
## Iteration:  7   Subgraph size= 5   vertices:  39 29 61 49 47
## Focal vertex:  61   nbrs:  39 43 47 48 49 50 51 53 57 58 66 68 75 76 78   sampled:  57
## Iteration:  8   Subgraph size= 6   vertices:  39 29 61 49 47 57
## Focal vertex:  57   nbrs:  39 43 47 48 49 50 51 53 58 61 66 68   sampled:  43
## Iteration:  9   Subgraph size= 7   vertices:  39 29 61 49 47 57 43
## Focal vertex:  29   nbrs:  11 18 21 24 30 39 43 47 49   sampled:  11
## Iteration:  10   Subgraph size= 8   vertices:  39 29 61 49 47 57 43 11
## Focal vertex:  11   nbrs:  5 18 21 24 29 30   sampled:  5
## Iteration:  11   Subgraph size= 9   vertices:  39 29 61 49 47 57 43 11 5
## Iteration:  12   Subgraph size= 9   vertices:  39 29 61 49 47 57 43 11 5
## Iteration:  13   Subgraph size= 9   vertices:  39 29 61 49 47 57 43 11 5
## Iteration:  14   Subgraph size= 9   vertices:  39 29 61 49 47 57 43 11 5
## Iteration:  15   Subgraph size= 9   vertices:  39 29 61 49 47 57 43 11 5
## Focal vertex:  5   nbrs:  6 11 15 18 21   sampled:  18
## Iteration:  16   Subgraph size= 10   vertices:  39 29 61 49 47 57 43 11 5 18

## 
## #components:  1
## Iteration:  1   Subgraph size= 1   vertices:  96
## Focal vertex:  96   nbrs:  83 84 85 88 92 93 94 95 97 98   sampled:  85
## Iteration:  2   Subgraph size= 2   vertices:  96 85
## Focal vertex:  96   nbrs:  83 84 85 88 92 93 94 95 97 98   sampled:  83
## Iteration:  3   Subgraph size= 3   vertices:  96 85 83
## Iteration:  4   Subgraph size= 3   vertices:  96 85 83
## Focal vertex:  85   nbrs:  73 77 79 83 84 88 92 93 94 95 96 98   sampled:  79
## Iteration:  5   Subgraph size= 4   vertices:  96 85 83 79
## Focal vertex:  85   nbrs:  73 77 79 83 84 88 92 93 94 95 96 98   sampled:  92
## Iteration:  6   Subgraph size= 5   vertices:  96 85 83 79 92
## Focal vertex:  96   nbrs:  83 84 85 88 92 93 94 95 97 98   sampled:  88
## Iteration:  7   Subgraph size= 6   vertices:  96 85 83 79 92 88
## Focal vertex:  96   nbrs:  83 84 85 88 92 93 94 95 97 98   sampled:  93
## Iteration:  8   Subgraph size= 7   vertices:  96 85 83 79 92 88 93
## Focal vertex:  92   nbrs:  83 84 85 89 93 94 95 96 97 98   sampled:  97
## Iteration:  9   Subgraph size= 8   vertices:  96 85 83 79 92 88 93 97
## Focal vertex:  88   nbrs:  77 79 85 93 94 95 96 98 99 100   sampled:  77
## Iteration:  10   Subgraph size= 9   vertices:  96 85 83 79 92 88 93 97 77
## Iteration:  11   Subgraph size= 9   vertices:  96 85 83 79 92 88 93 97 77
## Focal vertex:  92   nbrs:  83 84 85 89 93 94 95 96 97 98   sampled:  95
## Iteration:  12   Subgraph size= 10   vertices:  96 85 83 79 92 88 93 97 77 95

## 
## #components:  1
## Iteration:  1   Subgraph size= 1   vertices:  13
## Focal vertex:  13   nbrs:  2 8 10 16 19 20 23 25 27 31 32   sampled:  2
## Iteration:  2   Subgraph size= 2   vertices:  13 2
## Iteration:  3   Subgraph size= 2   vertices:  13 2
## Focal vertex:  13   nbrs:  2 8 10 16 19 20 23 25 27 31 32   sampled:  32
## Iteration:  4   Subgraph size= 3   vertices:  13 2 32
## Focal vertex:  13   nbrs:  2 8 10 16 19 20 23 25 27 31 32   sampled:  27
## Iteration:  5   Subgraph size= 4   vertices:  13 2 32 27
## Focal vertex:  2   nbrs:  8 10 13   sampled:  8
## Iteration:  6   Subgraph size= 5   vertices:  13 2 32 27 8
## Focal vertex:  32   nbrs:  13 16 19 20 23 25 31 33 34 38 41   sampled:  25
## Iteration:  7   Subgraph size= 6   vertices:  13 2 32 27 8 25
## Focal vertex:  8   nbrs:  2 10 13 16 19 20 23   sampled:  16
## Iteration:  8   Subgraph size= 7   vertices:  13 2 32 27 8 25 16
## Focal vertex:  25   nbrs:  10 13 16 19 20 23 31 32 33 34 38   sampled:  10
## Iteration:  9   Subgraph size= 8   vertices:  13 2 32 27 8 25 16 10
## Iteration:  10   Subgraph size= 8   vertices:  13 2 32 27 8 25 16 10
## Iteration:  11   Subgraph size= 8   vertices:  13 2 32 27 8 25 16 10
## Focal vertex:  8   nbrs:  2 10 13 16 19 20 23   sampled:  20
## Iteration:  12   Subgraph size= 9   vertices:  13 2 32 27 8 25 16 10 20
## Iteration:  13   Subgraph size= 9   vertices:  13 2 32 27 8 25 16 10 20
## Iteration:  14   Subgraph size= 9   vertices:  13 2 32 27 8 25 16 10 20
## Iteration:  15   Subgraph size= 9   vertices:  13 2 32 27 8 25 16 10 20
## Focal vertex:  13   nbrs:  2 8 10 16 19 20 23 25 27 31 32   sampled:  19
## Iteration:  16   Subgraph size= 10   vertices:  13 2 32 27 8 25 16 10 20 19

## 
## #components:  1
## Iteration:  1   Subgraph size= 1   vertices:  69
## Focal vertex:  69   nbrs:  52 56 60 63 73 79 83 84   sampled:  84
## Iteration:  2   Subgraph size= 2   vertices:  69 84
## Iteration:  3   Subgraph size= 2   vertices:  69 84
## Iteration:  4   Subgraph size= 2   vertices:  69 84
## Focal vertex:  69   nbrs:  52 56 60 63 73 79 83 84   sampled:  60
## Iteration:  5   Subgraph size= 3   vertices:  69 84 60
## Focal vertex:  84   nbrs:  69 73 79 83 85 92 93 94 95 96 97 98   sampled:  93
## Iteration:  6   Subgraph size= 4   vertices:  69 84 60 93
## Iteration:  7   Subgraph size= 4   vertices:  69 84 60 93
## Focal vertex:  69   nbrs:  52 56 60 63 73 79 83 84   sampled:  73
## Iteration:  8   Subgraph size= 5   vertices:  69 84 60 93 73
## Focal vertex:  60   nbrs:  41 45 54 55 56 65 69 70 71 72   sampled:  70
## Iteration:  9   Subgraph size= 6   vertices:  69 84 60 93 73 70
## Focal vertex:  70   nbrs:  54 55 59 60 64 65 67 71 72 74 77 79 81   sampled:  65
## Iteration:  10   Subgraph size= 7   vertices:  69 84 60 93 73 70 65
## Iteration:  11   Subgraph size= 7   vertices:  69 84 60 93 73 70 65
## Focal vertex:  84   nbrs:  69 73 79 83 85 92 93 94 95 96 97 98   sampled:  98
## Iteration:  12   Subgraph size= 8   vertices:  69 84 60 93 73 70 65 98
## Focal vertex:  65   nbrs:  54 55 59 60 64 67 70 71 72 74 81   sampled:  55
## Iteration:  13   Subgraph size= 9   vertices:  69 84 60 93 73 70 65 98 55
## Focal vertex:  98   nbrs:  84 85 88 92 93 94 95 96 97 100   sampled:  85
## Iteration:  14   Subgraph size= 10   vertices:  69 84 60 93 73 70 65 98 55 85

## 
## #components:  1
## Iteration:  1   Subgraph size= 1   vertices:  30
## Focal vertex:  30   nbrs:  11 18 21 24 29 39 43 47 49   sampled:  43
## Iteration:  2   Subgraph size= 2   vertices:  30 43
## Focal vertex:  43   nbrs:  24 29 30 39 47 48 49 50 51 53 57 61   sampled:  57
## Iteration:  3   Subgraph size= 3   vertices:  30 43 57
## Focal vertex:  30   nbrs:  11 18 21 24 29 39 43 47 49   sampled:  11
## Iteration:  4   Subgraph size= 4   vertices:  30 43 57 11
## Focal vertex:  57   nbrs:  39 43 47 48 49 50 51 53 58 61 66 68   sampled:  50
## Iteration:  5   Subgraph size= 5   vertices:  30 43 57 11 50
## Focal vertex:  57   nbrs:  39 43 47 48 49 50 51 53 58 61 66 68   sampled:  58
## Iteration:  6   Subgraph size= 6   vertices:  30 43 57 11 50 58
## Focal vertex:  50   nbrs:  36 39 40 43 46 47 48 49 51 53 57 58 61 62 66 68   sampled:  51
## Iteration:  7   Subgraph size= 7   vertices:  30 43 57 11 50 58 51
## Iteration:  8   Subgraph size= 7   vertices:  30 43 57 11 50 58 51
## Iteration:  9   Subgraph size= 7   vertices:  30 43 57 11 50 58 51
## Focal vertex:  30   nbrs:  11 18 21 24 29 39 43 47 49   sampled:  24
## Iteration:  10   Subgraph size= 8   vertices:  30 43 57 11 50 58 51 24
## Iteration:  11   Subgraph size= 8   vertices:  30 43 57 11 50 58 51 24
## Focal vertex:  57   nbrs:  39 43 47 48 49 50 51 53 58 61 66 68   sampled:  39
## Iteration:  12   Subgraph size= 9   vertices:  30 43 57 11 50 58 51 24 39
## Focal vertex:  51   nbrs:  36 40 42 43 44 46 48 49 50 53 57 58 61 62 66 68   sampled:  48
## Iteration:  13   Subgraph size= 10   vertices:  30 43 57 11 50 58 51 24 39 48

## 
## #components:  1

these all look good at first impression, and appear to be connected. But We should test the above more carefully to make sure it is really working.

Here's a smaller network to practise with. It comes from Bryan Queme. It's based on a single pathway and is stored as an igraph called "alz:

load("Interconversion_of_nucleotide_di_and_triphosphates.RData")
summary(alz)
## IGRAPH 2be4967 DN-- 511 3789 -- 
## + attr: name (v/c)
plot(alz,vertex.label.cex=0.2,edge.arrow.size=0.3)

It contains 551 vertices and 3789 edges. It's hard to see, so let's calculate some summary statistics:

get_diameter(alz, directed=F)
## + 8/511 vertices, named, from 2be4967:
## [1] NME2P1      CHEBI:16862 NME1        CHEBI:17677 CDS1        CHEBI:18420
## [7] DCTPP1      CHEBI:86351
deg_alz <- degree(alz)
# two ways:
# simple way
hist(deg_alz,
     xlab = "degree",
     ylab = "Frequency",
     main = "Histogram of alz node degrees, without adjusting breaks and ylim",
     col = "skyblue")

# a bit more granularity
alz_table <- table(deg_alz)
relafreq <- alz_table/sum(alz_table)
barplot(relafreq, xlab = "degree",
        ylab = "Relative frequencies",
        main = "Degree distribution of alz",
        ylim = range(pretty(c(0,relafreq))),
        col = "orange")

Now let's begin to implement some of this.

Preliminary Data!

Existing R packages for pathway handling and visualization

First of all, there are a wide variety of packages out there for handling networks.

Example 1: pathview (https://bioconductor.org/packages/release/bioc/html/pathview.html)

Vignettes are available:

library(pathview)
## 
## ##############################################################################
## Pathview is an open source software package distributed under GNU General
## Public License version 3 (GPLv3). Details of GPLv3 is available at
## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
## formally cite the original Pathview paper (not just mention it) in publications
## or products. For details, do citation("pathview") within R.
## 
## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
## license agreement (details at http://www.kegg.jp/kegg/legal.html).
## ##############################################################################
#browseVignettes("pathview")

Downloading KEGG data:

kd <- download.kegg(pathway.id = "00010", species = "hsa", kegg.dir = ".",
file.type=c("xml", "png"))
## Info: Downloading xml files for hsa00010, 1/1 pathways..
## Info: Downloading png files for hsa00010, 1/1 pathways..

The components of downloaded pathway are as follows:

xml.file=system.file("extdata", "hsa04110.xml", package = "pathview")
node.data=node.info(xml.file)
names(node.data)
##  [1] "kegg.names" "type"       "component"  "size"       "labels"    
##  [6] "shape"      "x"          "y"          "width"      "height"
#or parse into a graph object, then extract node info
gR1=pathview:::parseKGML2Graph2(xml.file, genesOnly=FALSE, expand=FALSE, split.group=FALSE)
node.data=node.info(gR1)
(gR1)
## A graphNEL graph with directed edges
## Number of Nodes = 115 
## Number of Edges = 79

Here's another existing library GraphNEL is the graph representation class (https://www.rdocumentation.org/packages/graph/versions/1.50.0/topics/graphNEL-class)

if(!require('graph')) {
  install.packages('graph')
  library('graph')
}

Let's look at the downloaded pathway So for our example, gR1, we have

#edges(gR1)
plot(gR1)

xml.file=system.file("extdata", "hsa04110.xml", package = "pathview")
gR1=pathview:::parseKGML2Graph2(xml.file, genesOnly=FALSE, expand=FALSE, split.group=FALSE)
#pathview(gR1)
#pv.out <- pathview(gene.data = gse16873.d[, 1], pathway.id = demo.paths$sel.paths[i], species = "hsa", out.suffix = "gse16873", kegg.native = F, sign.pos = demo.paths$spos[i])

Networks are stored as graph networks (collections of "edges" and "nodes"). It's easy enough to create our own graphs, as may be needed for methods testing. Here's an example of a directed network:

Nodes <- 6
set.seed(125)
Verts <- LETTERS[1:Nodes]
edL2 <- vector("list", length=Nodes)
names(edL2) <- Verts
MyWeights <- rep(1,2)
for(i in 1:Nodes){
  Connecteds <-sample(Verts,size=2,replace=FALSE)
  edL2[[i]] <- list(edges=Connecteds,weights=MyWeights)
    #edL[[i]] <- list(sample(c(0,1),8,replace=TRUE),weights=(runif(8)))
}
gR2 <- graphNEL(nodes=Verts, edgeL=edL2,edgemode='directed')
edges(gR2)
## $A
## [1] "F" "B"
## 
## $B
## [1] "C" "A"
## 
## $C
## [1] "A" "C"
## 
## $D
## [1] "D" "E"
## 
## $E
## [1] "D" "C"
## 
## $F
## [1] "C" "B"
#edgeWeights(gR)
plot(gR2)


 

Here's another existing library : paxtoolsr. Install via:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("paxtoolsr")

Vignettes found at: https://www.bioconductor.org/packages/devel/bioc/vignettes/paxtoolsr/inst/doc/using_paxtoolsr.html

library(paxtoolsr)
library(rJava)

# Example of merging two files
file1 <- system.file("extdata", "raf_map_kinase_cascade_reactome.owl", package = "paxtoolsr")
file2 <- system.file("extdata", "biopax3-short-metabolic-pathway.owl", package = "paxtoolsr")

mergedFile <- mergeBiopax(file1, file2)

 

From vignette:

Searching Pathway Commons

Networks can also be loaded using Pathway Commons rather than from local BioPAX files. First, we show how Pathway Commons can be searched.

Search Pathway Commons for 'glycolysis'-related pathways:

searchResults <- searchPc(q = "glycolysis", type = "pathway")

Example of visualization:

library(igraph)
sif <- toSif(system.file("extdata", "biopax3-short-metabolic-pathway.owl", package = "paxtoolsr"))

# graph.edgelist requires a matrix
g <- graph.edgelist(as.matrix(sif[, c(1, 3)]), directed = FALSE)
plot(g, layout = layout.fruchterman.reingold)

#print_all(g)

These graphs are "igraphs", from the R library of the same name. We will use iGraph structures for this work.


 

Simple example of Local Moran's I and permutation-based testing

## The legacy packages maptools, rgdal, and rgeos, underpinning this package
## will retire shortly. Please refer to R-spatial evolution reports on
## https://r-spatial.org/r/2023/05/15/evolution4.html for details.
## This package is now running under evolution status 0
## Loading required package: future
## 
## Attaching package: 'future'
## The following objects are masked from 'package:igraph':
## 
##     %->%, %<-%

Globals...

SizeOfOurNetwork <- 20
EdgeProb <- 0.15

We build another random network of 20 genes/nodes

## [1] "igraph"

We can do fancy things like letting the node size vary. Here it varies by degree:

# Compute node degrees (#links) and use that to set node size:
#deg <- degree(net)
sum(E(oldnet)==1)
## [1] 1
V(oldnet)$size <- 20 #deg*3
l <- layout_with_fr(oldnet)
plot(oldnet, layout=l, vertex.color="cyan",)

net <-  sample_gnp(SizeOfOurNetwork, EdgeProb,directed=FALSE)
plot(net)

Edges, vertices and entire mx can be accessed as follows: (nice tutorial at https://kateto.net/wp-content/uploads/2016/01/NetSciX_2016_Workshop.pdf)

E(net)
## + 36/36 edges from ef3326f:
##  [1]  1-- 2  1-- 3  3-- 5  5-- 7  5-- 8  3-- 9  4-- 9  6-- 9  1--10  8--10
## [11]  9--10  5--11  6--11 10--12  2--13  8--13  4--14  7--14 11--14  2--15
## [21]  8--15  1--16 10--16  3--17  7--17 12--17 13--17  5--18 12--18  4--19
## [31]  5--19 11--19  4--20  7--20 10--20 19--20
V(net)
## + 20/20 vertices, from ef3326f:
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
net[]
## 20 x 20 sparse Matrix of class "dgCMatrix"
##                                              
##  [1,] . 1 1 . . . . . . 1 . . . . . 1 . . . .
##  [2,] 1 . . . . . . . . . . . 1 . 1 . . . . .
##  [3,] 1 . . . 1 . . . 1 . . . . . . . 1 . . .
##  [4,] . . . . . . . . 1 . . . . 1 . . . . 1 1
##  [5,] . . 1 . . . 1 1 . . 1 . . . . . . 1 1 .
##  [6,] . . . . . . . . 1 . 1 . . . . . . . . .
##  [7,] . . . . 1 . . . . . . . . 1 . . 1 . . 1
##  [8,] . . . . 1 . . . . 1 . . 1 . 1 . . . . .
##  [9,] . . 1 1 . 1 . . . 1 . . . . . . . . . .
## [10,] 1 . . . . . . 1 1 . . 1 . . . 1 . . . 1
## [11,] . . . . 1 1 . . . . . . . 1 . . . . 1 .
## [12,] . . . . . . . . . 1 . . . . . . 1 1 . .
## [13,] . 1 . . . . . 1 . . . . . . . . 1 . . .
## [14,] . . . 1 . . 1 . . . 1 . . . . . . . . .
## [15,] . 1 . . . . . 1 . . . . . . . . . . . .
## [16,] 1 . . . . . . . . 1 . . . . . . . . . .
## [17,] . . 1 . . . 1 . . . . 1 1 . . . . . . .
## [18,] . . . . 1 . . . . . . 1 . . . . . . . .
## [19,] . . . 1 1 . . . . . 1 . . . . . . . . 1
## [20,] . . . 1 . . 1 . . 1 . . . . . . . . 1 .

Add attributes to the network, vertices, or edges as follows

V(net)$MyAttribute <- runif(length(V(net)),0,1)
#vertex_attr(net)
plot(net, edge.arrow.size=.5, vertex.label.color="black", vertex.label.dist=1.5,
vertex.size=4+10*V(net)$MyAttribute)


 

Taking advantage of what is there already...

There are a variety of potential useful functions for igraphs. For example, a network diameter is the longest geodesic distance (length of the shortest path between two nodes) in the network. In igraph, diameter() returns the distance, while get_diameter() returns the nodes along the first found path of that distance. Note that edge weights are used by default, unless set to NA.

diameter(net, directed=F, weights=NA)
## [1] 4
diameter(net, directed=F)
## [1] 4
diam <- get_diameter(net, directed=F)
diam
## + 5/20 vertices, from ef3326f:
## [1]  1  3  5  7 14
plot(net, vertex.color="cyan",)

It is simple to calculate distances between nodes:

(DM <- distances(net,v=V(net),to=V(net)))
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
##  [1,]    0    1    1    3    2    3    3    2    2     1     3     2     2
##  [2,]    1    0    2    4    3    4    3    2    3     2     4     3     1
##  [3,]    1    2    0    2    1    2    2    2    1     2     2     2     2
##  [4,]    3    4    2    0    2    2    2    3    1     2     2     3     4
##  [5,]    2    3    1    2    0    2    1    1    2     2     1     2     2
##  [6,]    3    4    2    2    2    0    3    3    1     2     1     3     4
##  [7,]    3    3    2    2    1    3    0    2    3     2     2     2     2
##  [8,]    2    2    2    3    1    3    2    0    2     1     2     2     1
##  [9,]    2    3    1    1    2    1    3    2    0     1     2     2     3
## [10,]    1    2    2    2    2    2    2    1    1     0     3     1     2
## [11,]    3    4    2    2    1    1    2    2    2     3     0     3     3
## [12,]    2    3    2    3    2    3    2    2    2     1     3     0     2
## [13,]    2    1    2    4    2    4    2    1    3     2     3     2     0
## [14,]    4    4    3    1    2    2    1    3    2     3     1     3     3
## [15,]    2    1    3    4    2    4    3    1    3     2     3     3     2
## [16,]    1    2    2    3    3    3    3    2    2     1     4     2     3
## [17,]    2    2    1    3    2    3    1    2    2     2     3     1     1
## [18,]    3    4    2    3    1    3    2    2    3     2     2     1     3
## [19,]    3    4    2    1    1    2    2    2    2     2     1     3     3
## [20,]    2    3    3    1    2    3    1    2    2     1     2     2     3
##       [,14] [,15] [,16] [,17] [,18] [,19] [,20]
##  [1,]     4     2     1     2     3     3     2
##  [2,]     4     1     2     2     4     4     3
##  [3,]     3     3     2     1     2     2     3
##  [4,]     1     4     3     3     3     1     1
##  [5,]     2     2     3     2     1     1     2
##  [6,]     2     4     3     3     3     2     3
##  [7,]     1     3     3     1     2     2     1
##  [8,]     3     1     2     2     2     2     2
##  [9,]     2     3     2     2     3     2     2
## [10,]     3     2     1     2     2     2     1
## [11,]     1     3     4     3     2     1     2
## [12,]     3     3     2     1     1     3     2
## [13,]     3     2     3     1     3     3     3
## [14,]     0     4     4     2     3     2     2
## [15,]     4     0     3     3     3     3     3
## [16,]     4     3     0     3     3     3     2
## [17,]     2     3     3     0     2     3     2
## [18,]     3     3     3     2     0     2     3
## [19,]     2     3     3     3     2     0     1
## [20,]     2     3     2     2     3     1     0

Write a function to calculate Moran's (global) I. Reminder: Moran's-I is defined as \[ I=\frac{N}{W}\frac{\sum_i \sum_j w_{ij} (x_i-\bar{x})(x_j-\bar{x})}{\sum_i(x_i-\bar{x})^2} \] This function, and what follows, needs to be carefully debugged to make sure it is calculating correctly.

## 
## Morans-I:  -0.005653249     expectation=  -0.05263158

Test it out...


 

Assessing the null distribution for Moran-s I via permutation tests

How to find immediate neighbors of a vertex (there is probably a built-in igraph function that does this automatically)

## [1]  3  7  8 11 18 19
## [1]  2  3 10 16

Smoothing the labels to make them correlated. Here we use a simple proof of principle scheme in which we generate the labels independently and then smooth them by taking a weighted average of each label and the label of its neighboring vertices. Later on, we will try something more formal.

##  [1] 0.88367943 0.79121789 0.97760836 0.88897901 0.19597838 0.13553089
##  [7] 0.10367339 0.76017829 0.74236245 0.18219717 0.57048192 0.18246939
## [13] 0.02143581 0.38805300 0.63042565 0.26248184 0.40052765 0.51692328
## [19] 0.56437519 0.09822876
##  [1] 0.6634774 0.6235953 0.6962941 0.5951629 0.4856496 0.3959765 0.2150224
##  [8] 0.4250656 0.6115067 0.4117243 0.4041502 0.2929174 0.3989591 0.4678481
## [15] 0.7030619 0.3977101 0.3477070 0.3530736 0.4804031 0.3226137

Test case: Global Moran's \(I\)

Let's compare Moran's-I for random labels and smoother labels...

## Warning in geom_dotplot(binaxis = "y", stackdir = "center", dotsize = 0.5, :
## Ignoring unknown parameters: `bin.width`
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.


 

Correlated node labels

And now for a more formal addition of spatial correlation, using correlated normals, where the degree of correlation depends upon the network structure (as originally proposed by George VY). So, we generate data \(y = (I_n - \rho W)^{-1} \times \epsilon\), where \(\epsilon \sim MVN(0,I_n)\). Since \(\epsilon \sim MVN(0,I_n)\) the independent version has attrributes with \(Normal(0,1)\) distribution.

Now write some tests for the above, comparing them to models in which the attributes are Normal(0,1).

## 
##  time taken=  0.2936811

That didn't seem to show much difference! So let's try something else. We pick a focal node at random and label it and all it's neighbors with a 1, whereas everything else is labeled rnorm(0,1) Assigning spatially correlated node labels

## 
## FocalNode:  18

## 
## Morans-I:  0.008850422     expectation=  -0.05263158

Still not showing a great deal of difference, although it did for some earlier test cases. Here, the focal node only has one neighbor. It will probably do better on a more connected graph. We should explore that.

Let's look at local measures:

 


Local Moran's-I (LISA) - test cases.

We suppose that each node has some (binary or continuous) annotation \(x_i\), and standardize those values by setting \(z_i=x_i-\bar{x}\). The LISA measure of local clustering for each node, \(i\), is then defined as \(I_i = z_i \sum_{j \in J_i} w_{ij}z_j.\)

Here, \(J_i\) is the set of neighbors of node \(i\) (although the definition can be generalized in an obvious way), \(w_{ij}\) is a weight that is used to characterize the distance between nodes. For example, the weight might measure the number of edges on the shortest path between nodes \(i\) and \(j\).

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

Semi-complete graphs

Let's build a good network on which to explore correlated node labels. We will use a set of complete graphs joined to each other by a single vertex.

Add correlated node labels to such a network

Test...

##  [1]  6.86862115  6.75172495  7.00766151  6.91738958  6.91397752  6.90824666
##  [7]  6.75172878  7.02131473  1.44210819  1.66602710 -0.53215062  2.04493799
## [13]  2.40051988  1.41547453  1.58636621 -0.54179208  0.52792876  1.44338373
## [19]  1.73866488  1.46266670  1.46276489  0.77961551  2.07857276  1.75096767
## [25]  1.35410309  0.82232771  1.02412542  1.30635492  0.46313050  1.47823717
## [31]  0.88671121  1.16960121  1.52602665  3.48803701 -0.19247715  1.11595977
## [37]  1.21784493 -0.09125695  1.24400994  2.54400913

## 
## Moran's-I: -0.000114094

## 
## Permuted Moran's-I: 0.0005563579

Test for spatial structure by looking at the maximum value of the LISA stat across nodes for each graph

That seems to work nicely.

Now try a version in which we permute everything except the focal node when calculating the p-value for that node

First we need a permutation function that keeps node i fixed

##  [1]  2.3407997  1.2966597  1.6204623  1.2539526  1.1018927  0.6652059
##  [7]  0.4986218  0.8254643  2.8090374  0.7698950 -0.1304182  1.2159889
## [13]  2.2322373  2.6093587  1.4015506  0.7270160  0.9638477  0.8496888
## [19]  4.7688104 -0.6524960

##  [1]  2.3407997  0.6652059  2.2322373  2.8090374  0.9638477 -0.6524960
##  [7]  0.8254643  1.2966597  0.4986218  1.2159889  0.7270160  0.8496888
## [13]  2.6093587 -0.1304182  1.6204623  1.4015506  1.1018927  4.7688104
## [19]  0.7698950  1.2539526

## 
##  1
##  2
##  3
##  4
##  5
##  6
##  7
##  8
##  9
##  10
##  11
##  12
##  13
##  14
##  15
##  16
##  17
##  18
##  19
##  20
##  [1] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0


          ## Other methods

There are a variety of other methods out there to detect clustering. There is no need to pursue these now, but they are worth keeping in mind for future reference.


Old notes: Tango's disease clusters (Tango 1999)

This method is written to look for 'regions' that have too many diseased indivs.

As such, it does not really fit into our paradigm, because we have no regions.

We could force it to (possibly?) work if we defined each gene node to be a 'region' in which there was just one potential case.

It extends the original Tango method, in that it allows the size of the locally clustered region to be a parameter that is searched over.


Kulldorf's local scan statistic (Kulldorf 1997)

This method is a local cluster detection algorithm that is implemented in the R package SaTScan.

The basic idea is to model, under the null, the occurrence of 'cases' as, in the simplest case, a Poisson process, and then attempt to detect whether this model, which assumes independence everywhere, fits.

If it does not fit, clustering is assumed to be occurring.

It is then extended to other 'null' measures as well, to account for uneven distributions of potential cases (I think), but I don't think we would need those ideas here.

The method originally tried to detected circular clusters of 'alike' values,but has been extended to non-circular shapes. Ideas like this would presumably be sensible if we used local neighborhoods rather than 'shapes'.

An important characteristic of this method is that when the null hypothesisis rejected it indicates the specific area of the map that causes the rejection.

Again, though, it is a region-based method, so not ideal for networks


Besag and Newell's method (Besag et al. 1991)

This is a local cluster detection method that assumes the landscape is divided into regions.

The paper talks about why the assumptions of Moran's-I might be invalid even when disease risk is \(p\) for everyone and all are independent.

Essentially, if the sizes of the regions varies, as it will, then the variance of the incidence rates will also vary from region to region.

This actually induces spatial correlation in the disease rates even when the disease rate is constant from individual to individual. This won't be a problem in our context, because there is one potential 'case' in each region, so all 'disease' rates have the same distribution.


Arbitrarily shaped multiple spatial cluster detection for case event data (Dematte?? et al. 2007)

This paper proposes a method for spatial cluster detection of case event data.

Here, each case has a unique location in space.

Essentially, they look at a path that includes (just) all cases, and look at whether the average distance between points in this path is unusually low (suggesting the existence of clusters.)

These ideas look like they might, with some thought, generalize to our setting. The network would just impose constraints on how the path is constructed.

The paper also contains a nice (brief) overview of methods, referencing several recently proposed methods.